Yule–Simon distribution (yulesimon)#
The Yule–Simon distribution is a heavy-tailed discrete distribution on the positive integers. It’s a classic model for rich-get-richer / preferential attachment dynamics: a few categories become very large, while most stay small.
Learning goals#
Recognize when Yule–Simon is a reasonable model for count data with a power-law tail.
Write the PMF/CDF in terms of Beta/Gamma functions (and understand the tail behavior).
Know which moments exist (and when the mean/variance/skewness/kurtosis are infinite).
Derive the expectation and variance via a Beta–geometric mixture representation.
Implement sampling with NumPy only and validate it with Monte Carlo simulation.
Use
scipy.stats.yulesimonandscipy.stats.fitfor evaluation, simulation, and parameter estimation.
Prerequisites#
Discrete probability (PMF/CDF), expectation, variance
Comfort with logs and basic calculus
Familiarity with the Gamma/Beta functions is helpful (but introduced as needed)
Notebook roadmap#
Title & Classification
Intuition & Motivation
Formal Definition
Moments & Properties
Parameter Interpretation
Derivations (Expectation, Variance, Likelihood)
Sampling & Simulation (NumPy-only)
Visualization (PMF, CDF, Monte Carlo)
SciPy Integration
Statistical Use Cases
Pitfalls
Summary
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from scipy import optimize, special, stats
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=6, suppress=True)
SEED = 7
rng = np.random.default_rng(SEED)
import sys
import scipy
import plotly
print("Python:", sys.version.split()[0])
print("NumPy:", np.__version__)
print("SciPy:", scipy.__version__)
print("Plotly:", plotly.__version__)
print("Seed:", SEED)
Python: 3.12.9
NumPy: 1.26.2
SciPy: 1.15.0
Plotly: 6.5.2
Seed: 7
1) Title & Classification#
Name:
yulesimon(Yule–Simon distribution)Type: Discrete
Support (SciPy
loc=0): (k \in {1,2,3,\dots})With a location shift
loc, support becomes (k \in {\mathrm{loc}+1,, \mathrm{loc}+2,, \dots}).
Parameter space: (\alpha > 0) (shape)
Notation:
(K \sim \mathrm{YuleSimon}(\alpha)).
SciPy uses the shape parameter name
alpha:scipy.stats.yulesimon(alpha, loc=0).
2) Intuition & Motivation#
What this distribution models#
The Yule–Simon distribution models positive integer counts with a power-law (heavy) tail:
many observations are small (1, 2, 3, …)
a non-negligible fraction are very large
This pattern often comes from cumulative advantage dynamics:
items that are already frequent are more likely to be observed again.
Typical real-world use cases#
Word frequencies in text (a few words are very common, most are rare)
Citations of scientific papers
In-degree in networks with preferential attachment (links to popular pages keep accumulating)
Popularity / sales counts (blockbusters vs the long tail)
Relations to other distributions#
Power laws / Pareto / Zipf: Yule–Simon is a discrete distribution with a power-law tail.
Mixture of geometric distributions: if (P \sim \mathrm{Beta}(\alpha,1)) and (K\mid P \sim \mathrm{Geometric}(P)) on ({1,2,\dots}), then (K) is Yule–Simon.
Simon’s model (preferential attachment with innovation) yields Yule–Simon-like frequency distributions.
3) Formal Definition#
Let (K \sim \mathrm{YuleSimon}(\alpha)) with (\alpha>0) and support (k=1,2,\dots).
PMF#
A standard closed form uses the beta function (B(\cdot,\cdot)):
[ \mathbb{P}(K=k\mid \alpha) = \alpha,B(k,\alpha+1), \qquad k=1,2,\dots ]
Using (B(x,y)=\frac{\Gamma(x)\Gamma(y)}{\Gamma(x+y)}), this is equivalent to
[ \mathbb{P}(K=k\mid \alpha) = \alpha,\frac{\Gamma(k)\Gamma(\alpha+1)}{\Gamma(k+\alpha+1)}. ]
CDF (and survival function)#
Because this is discrete, for real (x),
[ F(x)=\mathbb{P}(K \le x)=\mathbb{P}(K \le \lfloor x\rfloor). ]
For integer (k\ge 1), a convenient closed form is
[ F(k)=1-\mathbb{P}(K>k) =1-\alpha,B(k+1,\alpha) =1-k,B(k,\alpha+1). ]
We will use the survival function form for numerical stability when (F(k)) is very close to 1.
import math
def validate_alpha(alpha: float) -> float:
alpha = float(alpha)
if not np.isfinite(alpha) or alpha <= 0:
raise ValueError(f"alpha must be positive and finite; got {alpha!r}")
return alpha
_lgamma = np.vectorize(math.lgamma, otypes=[float])
def yulesimon_logpmf(k, alpha: float) -> np.ndarray:
'''Log-PMF of Yule–Simon(alpha) at integer k >= 1.
Implemented with Python's `math.lgamma` (no SciPy special functions).
'''
alpha = validate_alpha(alpha)
k = np.asarray(k)
kf = k.astype(float)
out = np.full_like(kf, fill_value=-np.inf, dtype=float)
mask = (kf >= 1) & (kf == np.floor(kf))
if np.any(mask):
kv = kf[mask]
out[mask] = (
np.log(alpha)
+ _lgamma(alpha + 1.0)
+ _lgamma(kv)
- _lgamma(kv + alpha + 1.0)
)
return out
def yulesimon_pmf(k, alpha: float) -> np.ndarray:
'''PMF of Yule–Simon(alpha) at k (returns 0 outside the support).'''
return np.exp(yulesimon_logpmf(k, alpha))
def yulesimon_logsf(k, alpha: float) -> np.ndarray:
'''Log survival function log P(K > k) for integer k >= 0.
Uses: P(K > k) = alpha * B(k+1, alpha).
'''
alpha = validate_alpha(alpha)
k = np.asarray(k)
kf = k.astype(float)
out = np.full_like(kf, fill_value=-np.inf, dtype=float)
# For any k < 1, P(K > k) = 1 because the support starts at 1.
out[kf < 1] = 0.0
mask = (kf >= 1) & (kf == np.floor(kf))
if np.any(mask):
kv = kf[mask]
out[mask] = (
np.log(alpha)
+ _lgamma(kv + 1.0)
+ _lgamma(alpha)
- _lgamma(kv + alpha + 1.0)
)
return out
def yulesimon_sf(k, alpha: float) -> np.ndarray:
return np.exp(yulesimon_logsf(k, alpha))
def yulesimon_cdf(x, alpha: float) -> np.ndarray:
'''CDF evaluated at real x via floor(x).'''
alpha = validate_alpha(alpha)
x = np.asarray(x)
k = np.floor(x)
out = np.zeros_like(k, dtype=float)
mask = k >= 1
if np.any(mask):
out[mask] = 1.0 - yulesimon_sf(k[mask], alpha)
return out
# Quick sanity checks vs SciPy
alpha = 2.5
k = np.arange(1, 11)
pmf_np = yulesimon_pmf(k, alpha)
pmf_sp = stats.yulesimon.pmf(k, alpha)
print("max |pmf numpy - scipy|:", float(np.max(np.abs(pmf_np - pmf_sp))))
x = np.array([-3.0, 0.0, 0.9, 1.0, 2.0, 5.0, 10.0])
cdf_np = yulesimon_cdf(x, alpha)
cdf_sp = stats.yulesimon.cdf(x, alpha)
print("max |cdf numpy - scipy|:", float(np.max(np.abs(cdf_np - cdf_sp))))
# PMF should sum to 1; we can check this by truncating and adding the exact tail mass.
K = 2000
mass_trunc = yulesimon_pmf(np.arange(1, K + 1), alpha).sum()
tail_mass = yulesimon_sf(K, alpha)
print("mass_trunc + tail_mass:", float(mass_trunc + tail_mass))
max |pmf numpy - scipy|: 7.771561172376096e-16
max |cdf numpy - scipy|: 1.1102230246251565e-16
mass_trunc + tail_mass: 0.9999999999999994
4) Moments & Properties#
Existence of moments#
The Yule–Simon distribution is heavy-tailed. In fact, the (m)-th raw moment exists iff
[ \mathbb{E}[K^m] < \infty \quad \Longleftrightarrow \quad \alpha > m. ]
So:
mean exists only for (\alpha>1)
variance exists only for (\alpha>2)
skewness exists only for (\alpha>3)
(excess) kurtosis exists only for (\alpha>4)
Mean, variance, skewness, kurtosis#
For (\alpha>1),
[ \mathbb{E}[K] = \frac{\alpha}{\alpha-1}. ]
For (\alpha>2),
[ \mathrm{Var}(K)=\frac{\alpha^2}{(\alpha-1)^2(\alpha-2)}. ]
For (\alpha>3), the skewness is
[ \gamma_1 = \frac{(\alpha+1)^2\sqrt{\alpha-2}}{\alpha(\alpha-3)}. ]
For (\alpha>4), the excess kurtosis is
[ \gamma_2 = \frac{\alpha^4 + 7\alpha^3 - 9\alpha^2 - 13\alpha - 22}{\alpha(\alpha-3)(\alpha-4)}. ]
(The kurtosis is (\gamma_2 + 3).)
PGF / MGF / characteristic function#
A useful analytic object for discrete distributions is the probability generating function (PGF)
[ G(z)=\mathbb{E}[z^K], \qquad |z|<1. ]
For Yule–Simon,
[ G(z)=\frac{\alpha z}{\alpha+1},{}_2F_1(1,1;\alpha+2;z), ]
where ({}_2F_1) is the Gauss hypergeometric function.
The MGF is (M(t)=\mathbb{E}[e^{tK}]=G(e^t)), which is finite for (t<0) (since (e^t<1)). It diverges for any (t>0) because of the power-law tail.
The characteristic function is (\varphi(t)=\mathbb{E}[e^{itK}]=G(e^{it})). It exists for all real (t), though numerical evaluation at (t=0) should be handled carefully.
Entropy#
The Shannon entropy is
[ H(K)=-\sum_{k=1}^\infty p(k)\log p(k). ]
It is finite for all (\alpha>0), but (to my knowledge) there is no simple closed form; we can evaluate it numerically.
def yulesimon_theoretical_stats(alpha: float) -> dict:
alpha = validate_alpha(alpha)
mean = np.inf if alpha <= 1 else alpha / (alpha - 1)
var = np.inf if alpha <= 2 else alpha**2 / ((alpha - 1) ** 2 * (alpha - 2))
skew = (
np.inf
if alpha <= 3
else ((alpha + 1) ** 2 * np.sqrt(alpha - 2)) / (alpha * (alpha - 3))
)
excess_kurt = (
np.inf
if alpha <= 4
else (alpha**4 + 7 * alpha**3 - 9 * alpha**2 - 13 * alpha - 22)
/ (alpha * (alpha - 3) * (alpha - 4))
)
kurt = np.inf if alpha <= 4 else excess_kurt + 3
return {
"alpha": alpha,
"mean": mean,
"var": var,
"skew": skew,
"kurt": kurt,
"excess_kurt": excess_kurt,
}
def yulesimon_pgf(z, alpha: float):
'''Probability generating function G(z)=E[z^K] for |z|<1.'''
alpha = validate_alpha(alpha)
z = np.asarray(z, dtype=complex)
return (alpha * z / (alpha + 1.0)) * special.hyp2f1(1.0, 1.0, alpha + 2.0, z)
def yulesimon_mgf(t, alpha: float):
'''MGF M(t)=E[e^{tK}] (finite for t<0).'''
t = np.asarray(t, dtype=float)
if np.any(t >= 0):
raise ValueError("MGF diverges for t >= 0 (heavy tail); use t < 0")
return yulesimon_pgf(np.exp(t), alpha)
def yulesimon_cf(t, alpha: float):
'''Characteristic function phi(t)=E[e^{itK}].'''
t = np.asarray(t, dtype=float)
out = yulesimon_pgf(np.exp(1j * t), alpha)
out = np.where(t == 0, 1.0 + 0j, out)
return out
for a in [1.5, 2.5, 5.0]:
mvsk = stats.yulesimon.stats(a, moments="mvsk")
ent = stats.yulesimon.entropy(a)
print()
print(f"alpha={a}")
print("theory (closed form when finite):", yulesimon_theoretical_stats(a))
print("SciPy stats (mean, var, skew, kurt):", tuple(float(x) for x in mvsk))
print("SciPy entropy:", float(ent))
# MGF spot-check for a negative t
alpha = 3.5
t = -0.3
print()
print("MGF(t) at t=-0.3 (complex rounding):", complex(yulesimon_mgf(t, alpha)))
print("CF(t) at t=0.7:", complex(yulesimon_cf(0.7, alpha)))
alpha=1.5
theory (closed form when finite): {'alpha': 1.5, 'mean': 3.0, 'var': inf, 'skew': inf, 'kurt': inf, 'excess_kurt': inf}
SciPy stats (mean, var, skew, kurt): (3.0, inf, nan, nan)
SciPy entropy: 1.503896885539161
alpha=2.5
theory (closed form when finite): {'alpha': 2.5, 'mean': 1.6666666666666667, 'var': 5.555555555555555, 'skew': inf, 'kurt': inf, 'excess_kurt': inf}
SciPy stats (mean, var, skew, kurt): (1.6666666666666667, 5.555555555555555, inf, inf)
SciPy entropy: 1.0232863517009578
alpha=5.0
theory (closed form when finite): {'alpha': 5.0, 'mean': 1.25, 'var': 0.5208333333333334, 'skew': 6.235382907247958, 'kurt': 121.8, 'excess_kurt': 118.8}
SciPy stats (mean, var, skew, kurt): (1.25, 0.5208333333333334, 6.235382907247958, 118.8)
SciPy entropy: 0.6061305592752749
MGF(t) at t=-0.3 (complex rounding): (0.6798440777897914+0j)
CF(t) at t=0.7: (0.5734322687509957+0.6747782631823974j)
/home/tempa/miniconda3/lib/python3.12/site-packages/scipy/stats/_discrete_distns.py:1834: RuntimeWarning:
invalid value encountered in sqrt
/home/tempa/miniconda3/lib/python3.12/site-packages/numpy/lib/function_base.py:2410: RuntimeWarning:
expect(): sum did not converge
/home/tempa/miniconda3/lib/python3.12/site-packages/numpy/lib/function_base.py:2455: RuntimeWarning:
expect(): sum did not converge
5) Parameter Interpretation#
The single shape parameter (\alpha) controls tail heaviness.
Smaller (\alpha) (\Rightarrow) heavier tail (large counts become more likely).
Larger (\alpha) (\Rightarrow) faster decay and more mass near 1.
Two useful facts:
Tail exponent: as (k\to\infty),
[ \mathbb{P}(K=k) \sim \alpha,\Gamma(\alpha+1),k^{-(\alpha+1)}. ]
So (\alpha) directly controls the power-law exponent.
Moment thresholds: the mean/variance/etc exist only when (\alpha) is above the corresponding order. This matters in practice: for (\alpha \le 2), the empirical variance can be dominated by rare extreme samples.
# Shape changes as alpha varies
alphas = [0.7, 1.2, 2.5, 5.0, 10.0]
k = np.arange(1, 60)
fig = go.Figure()
for a in alphas:
fig.add_trace(
go.Scatter(
x=k,
y=stats.yulesimon.pmf(k, a),
mode="lines+markers",
name=f"alpha={a}",
)
)
fig.update_layout(
title="Yule–Simon PMF for different α (log–log view)",
xaxis_title="k",
yaxis_title="P(K=k)",
)
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig.show()
6) Derivations#
A very convenient representation is a Beta–geometric mixture.
6.1 PMF via a mixture#
Let
(P \sim \mathrm{Beta}(\alpha, 1)) with density (f(p)=\alpha p^{\alpha-1}) on (0<p<1)
(K\mid P=p \sim \mathrm{Geometric}(p)) on ({1,2,\dots}), i.e. (\mathbb{P}(K=k\mid p)=p(1-p)^{k-1})
Then the marginal PMF is
[ \mathbb{P}(K=k) = \int_0^1 p(1-p)^{k-1},\alpha p^{\alpha-1},dp = \alpha\int_0^1 p^{\alpha}(1-p)^{k-1},dp = \alpha,B(k,\alpha+1). ]
6.2 Expectation#
Using the law of total expectation and the fact that (\mathbb{E}[K\mid P=p]=1/p),
[ \mathbb{E}[K] = \mathbb{E}[\mathbb{E}[K\mid P]] = \mathbb{E}\left[\frac{1}{P}\right] = \alpha\int_0^1 p^{\alpha-2},dp = \frac{\alpha}{\alpha-1}, ]
which is finite only for (\alpha>1).
6.3 Variance#
Using the law of total variance,
[ \mathrm{Var}(K)=\mathbb{E}[\mathrm{Var}(K\mid P)] + \mathrm{Var}(\mathbb{E}[K\mid P]). ]
For the geometric distribution on ({1,2,\dots}), (\mathrm{Var}(K\mid p)=(1-p)/p^2) and (\mathbb{E}[K\mid p]=1/p). So
[ \mathrm{Var}(K)=\mathbb{E}\left[\frac{1-P}{P^2}\right] + \mathrm{Var}\left(\frac{1}{P}\right) = \frac{\alpha^2}{(\alpha-1)^2(\alpha-2)}, ]
finite only for (\alpha>2).
6.4 Likelihood#
Given i.i.d. data (k_1,\dots,k_n) (each (k_i\in{1,2,\dots})), the log-likelihood is
[ \ell(\alpha) = \sum_{i=1}^n \log \mathbb{P}(K=k_i\mid \alpha) = n\log\alpha + n\log\Gamma(\alpha+1)
\sum_{i=1}^n \log\Gamma(k_i)
\sum_{i=1}^n \log\Gamma(k_i+\alpha+1). ]
Differentiating gives the score equation (using the digamma function (\psi=\Gamma’/\Gamma)):
[ \frac{d\ell}{d\alpha} = \frac{n}{\alpha} + n\psi(\alpha+1) - \sum_{i=1}^n \psi(k_i+\alpha+1) = 0, ]
which can be solved numerically for the MLE.
def yulesimon_loglik(alpha: float, data: np.ndarray) -> float:
'''Log-likelihood for i.i.d. sample from Yule–Simon(alpha).'''
alpha = validate_alpha(alpha)
data = np.asarray(data)
if np.any(data < 1) or np.any(np.floor(data) != data):
raise ValueError("All observations must be integers >= 1")
n = data.size
return (
n * np.log(alpha)
+ n * special.gammaln(alpha + 1.0)
+ np.sum(special.gammaln(data))
- np.sum(special.gammaln(data + alpha + 1.0))
)
def yulesimon_score(alpha: float, data: np.ndarray) -> float:
'''Derivative of log-likelihood w.r.t. alpha.'''
alpha = validate_alpha(alpha)
data = np.asarray(data)
n = data.size
return float(
n / alpha
+ n * special.digamma(alpha + 1.0)
- np.sum(special.digamma(data + alpha + 1.0))
)
# Fit alpha by MLE on simulated data
alpha_true = 2.5
n = 5000
data = stats.yulesimon.rvs(alpha_true, size=n, random_state=rng)
res = optimize.minimize_scalar(
lambda a: -yulesimon_loglik(a, data), bounds=(1e-6, 50.0), method="bounded"
)
alpha_hat = float(res.x)
fit_res = stats.fit(
stats.yulesimon,
data,
bounds={"alpha": (1e-6, 50.0), "loc": (0, 0)},
guess={"alpha": alpha_hat, "loc": 0},
)
print("alpha_true:", alpha_true)
print("alpha_hat (MLE via minimize_scalar):", alpha_hat)
print("score(alpha_hat):", yulesimon_score(alpha_hat, data))
print("SciPy stats.fit params:", fit_res.params)
alpha_true: 2.5
alpha_hat (MLE via minimize_scalar): 2.466878956783674
score(alpha_hat): -0.0001069346399162896
SciPy stats.fit params: FitParams(alpha=2.4668784592208555, loc=0.0)
7) Sampling & Simulation (NumPy-only)#
Using the Beta–geometric mixture gives a simple sampler.
Draw (U \sim \mathrm{Uniform}(0,1)) and set (P = U^{1/\alpha}). This works because if (P \sim \mathrm{Beta}(\alpha,1)), then (P = U^{1/\alpha}) in distribution.
Given (P=p), draw (K\mid P=p \sim \mathrm{Geometric}(p)) on ({1,2,\dots}).
For the geometric step we can use inverse transform sampling:
[ K = 1 + \left\lfloor \frac{\log V}{\log(1-p)} \right\rfloor,\qquad V\sim \mathrm{Uniform}(0,1), ]
(where (\log(1-p)<0)).
def yulesimon_rvs_numpy(alpha: float, size: int, rng: np.random.Generator | None = None):
'''Sample from Yule–Simon(alpha) using only NumPy + standard library math.
Returns samples on {1, 2, ...}.
'''
alpha = validate_alpha(alpha)
if rng is None:
rng = np.random.default_rng()
size = int(size)
# Step 1: P ~ Beta(alpha, 1) via inverse transform: P = U^(1/alpha)
u = rng.random(size)
u = np.clip(u, np.finfo(float).tiny, 1.0)
p = u ** (1.0 / alpha)
# Step 2: K | P=p ~ Geometric(p) on {1,2,...}
v = rng.random(size)
v = np.clip(v, np.finfo(float).tiny, 1.0)
# Use log1p(-p) for stability when p is small.
k = 1 + np.floor(np.log(v) / np.log1p(-p))
return k.astype(int)
alpha = 2.5
samples = yulesimon_rvs_numpy(alpha, size=200_000, rng=rng)
print("min/max:", int(samples.min()), int(samples.max()))
print("sample mean:", float(samples.mean()))
print("sample var:", float(samples.var()))
print("theoretical:", yulesimon_theoretical_stats(alpha))
min/max: 1 195
sample mean: 1.661185
sample var: 4.414319395774998
theoretical: {'alpha': 2.5, 'mean': 1.6666666666666667, 'var': 5.555555555555555, 'skew': inf, 'kurt': inf, 'excess_kurt': inf}
8) Visualization#
We’ll visualize:
the PMF (including a log–log view for tail behavior)
the CDF / survival function
Monte Carlo samples compared to the theoretical distribution
alpha = 2.5
# PMF/CDF for small k
k_small = np.arange(1, 31)
pmf = stats.yulesimon.pmf(k_small, alpha)
cdf = stats.yulesimon.cdf(k_small, alpha)
fig_pmf = go.Figure()
fig_pmf.add_trace(go.Bar(x=k_small, y=pmf, name="PMF"))
fig_pmf.update_layout(
title=f"Yule–Simon PMF (alpha={alpha})",
xaxis_title="k",
yaxis_title="P(K=k)",
)
fig_pmf.show()
fig_cdf = go.Figure()
fig_cdf.add_trace(go.Scatter(x=k_small, y=cdf, mode="lines+markers", name="CDF"))
fig_cdf.update_layout(
title=f"Yule–Simon CDF (alpha={alpha})",
xaxis_title="k",
yaxis_title="P(K ≤ k)",
)
fig_cdf.show()
# Monte Carlo vs theory (empirical CDF and survival)
mc = stats.yulesimon.rvs(alpha, size=80_000, random_state=rng)
mc_sorted = np.sort(mc)
k_tail = np.arange(1, 200)
emp_cdf = np.searchsorted(mc_sorted, k_tail, side="right") / mc_sorted.size
emp_sf = 1.0 - emp_cdf
theo_sf = stats.yulesimon.sf(k_tail, alpha)
fig_tail = go.Figure()
fig_tail.add_trace(go.Scatter(x=k_tail, y=theo_sf, mode="lines", name="theory sf"))
fig_tail.add_trace(
go.Scatter(x=k_tail, y=emp_sf, mode="markers", name="empirical sf", opacity=0.6)
)
fig_tail.update_layout(
title=f"Survival function on log–log scale (alpha={alpha})",
xaxis_title="k",
yaxis_title="P(K > k)",
)
fig_tail.update_xaxes(type="log")
fig_tail.update_yaxes(type="log")
fig_tail.show()
9) SciPy Integration#
SciPy implements this distribution as scipy.stats.yulesimon with shape parameter alpha and optional loc.
Common methods:
stats.yulesimon.pmf(k, alpha, loc=0)stats.yulesimon.cdf(k, alpha, loc=0)stats.yulesimon.rvs(alpha, loc=0, size=..., random_state=...)
For fitting, use scipy.stats.fit (note: the distribution object itself does not have a .fit method):
stats.fit(stats.yulesimon, data, bounds=..., method='mle')
alpha = 2.5
k = np.arange(1, 6)
print("pmf:", stats.yulesimon.pmf(k, alpha))
print("cdf:", stats.yulesimon.cdf(k, alpha))
print("rvs:", stats.yulesimon.rvs(alpha, size=10, random_state=rng))
# Fit alpha (and optionally loc). Here we fix loc=0.
data = stats.yulesimon.rvs(alpha, size=2000, random_state=rng)
fit_res = stats.fit(
stats.yulesimon, data, bounds={"alpha": (1e-6, 50.0), "loc": (0, 0)}
)
print("fit params:", fit_res.params)
print("negative log-likelihood at fit:", float(fit_res.nllf()))
pmf: [0.714286 0.15873 0.05772 0.02664 0.014208]
cdf: [0.714286 0.873016 0.930736 0.957376 0.971584]
rvs: [1 2 1 1 1 4 1 7 1 2]
fit params: FitParams(alpha=2.5872600619798414, loc=0.0)
negative log-likelihood at fit: 1991.4834873154202
10) Statistical Use Cases#
10.1 Hypothesis testing / goodness-of-fit#
Goodness-of-fit: fit (\alpha) by MLE and compare observed frequencies to expected frequencies (e.g., a chi-square test with sensible binning, or a parametric bootstrap).
Model comparison: compare log-likelihoods / information criteria between heavy-tailed candidates (Yule–Simon, Zipf, negative binomial, etc.).
10.2 Bayesian modeling#
In Bayesian settings, (\alpha) can be given a prior (e.g., Gamma prior on (\alpha)) and inferred from data using:
grid approximation (1D problems)
MCMC / HMC (more complex models)
variational inference (large-scale)
The likelihood is easy to evaluate in log space, which is helpful for posterior computation.
10.3 Generative modeling#
Yule–Simon shows up as the stationary/limiting distribution of preferential attachment mechanisms. A classic example is Simon’s model: with probability (p_{\text{new}}) create a new category; otherwise, copy an existing category proportional to its current count. The induced category-size distribution has a Yule–Simon-like heavy tail.
# 10.1 Goodness-of-fit (illustration): chi-square with a tail bin
alpha_true = 2.5
n = 5000
x = stats.yulesimon.rvs(alpha_true, size=n, random_state=rng)
# Fit alpha by MLE (loc fixed at 0)
res = optimize.minimize_scalar(
lambda a: -yulesimon_loglik(a, x), bounds=(1e-6, 50.0), method="bounded"
)
alpha_hat = float(res.x)
k_max = 12 # bins: 1..k_max-1, plus tail (>=k_max)
obs = np.array([(x == k).sum() for k in range(1, k_max)] + [(x >= k_max).sum()])
probs = np.append(
stats.yulesimon.pmf(np.arange(1, k_max), alpha_hat),
stats.yulesimon.sf(k_max - 1, alpha_hat),
)
exp = n * probs
chi2, p_value = stats.chisquare(f_obs=obs, f_exp=exp)
print("alpha_true:", alpha_true)
print("alpha_hat:", alpha_hat)
print("obs counts:", obs)
print("exp counts:", np.round(exp, 2))
print("chi2:", float(chi2))
print("p-value (naive df; alpha was estimated):", float(p_value))
alpha_true: 2.5
alpha_hat: 2.5961292376240657
obs counts: [3587 827 266 128 63 33 35 17 12 5 4 23]
exp counts: [3609.62 785.36 280.68 127.66 67.22 39.1 24.45 16.15 11.14
7.96 5.86 24.81]
chi2: 10.821869911711197
p-value (naive df; alpha was estimated): 0.4583050042312996
# 10.2 Bayesian modeling (illustration): 1D grid posterior for alpha
x = stats.yulesimon.rvs(2.5, size=1500, random_state=rng)
alpha_grid = np.linspace(0.2, 10.0, 500)
# Prior: Gamma(a=2, scale=2) (mean=4)
log_prior = stats.gamma.logpdf(alpha_grid, a=2.0, scale=2.0)
log_like = np.array([yulesimon_loglik(a, x) for a in alpha_grid])
log_post = log_prior + log_like
log_post -= np.max(log_post)
post_unnorm = np.exp(log_post)
# Normalize (continuous approximation)
post = post_unnorm / np.trapz(post_unnorm, alpha_grid)
alpha_map = float(alpha_grid[np.argmax(post)])
alpha_mean = float(np.trapz(alpha_grid * post, alpha_grid))
print("posterior MAP:", alpha_map)
print("posterior mean:", alpha_mean)
fig = go.Figure()
fig.add_trace(go.Scatter(x=alpha_grid, y=post, mode="lines", name="posterior"))
fig.add_vline(x=alpha_map, line_dash="dash", line_color="black", annotation_text="MAP")
fig.update_layout(
title="Posterior over α (grid approximation)",
xaxis_title="alpha",
yaxis_title="density",
)
fig.show()
posterior MAP: 2.61563126252505
posterior mean: 2.6291776944815863
# 10.3 Generative modeling: Simon's preferential-attachment process
def simon_process_counts(T: int, p_new: float, rng: np.random.Generator) -> np.ndarray:
'''Simon's model.
At each step t:
- with probability p_new: create a new category
- otherwise: pick a previous token uniformly and copy its category
Copying a previous token chooses categories proportional to their current counts.
Returns the final category counts.
'''
if not (0.0 < p_new < 1.0):
raise ValueError("p_new must be in (0,1)")
owners = np.empty(T, dtype=int)
counts = np.zeros(T, dtype=int)
owners[0] = 0
counts[0] = 1
n_types = 1
for t in range(1, T):
if rng.random() < p_new:
type_id = n_types
n_types += 1
counts[type_id] = 1
else:
type_id = owners[rng.integers(t)]
counts[type_id] += 1
owners[t] = type_id
return counts[:n_types]
T = 50_000
p_new = 0.2
alpha_theory = 1.0 / (1.0 - p_new)
counts = simon_process_counts(T=T, p_new=p_new, rng=rng)
k = np.arange(1, 60)
emp_pmf = np.array([(counts == kk).mean() for kk in k])
theo_pmf = stats.yulesimon.pmf(k, alpha_theory)
fig = go.Figure()
fig.add_trace(
go.Scatter(x=k, y=emp_pmf, mode="markers", name="Simon empirical", opacity=0.7)
)
fig.add_trace(go.Scatter(x=k, y=theo_pmf, mode="lines", name=f"Yule–Simon α={alpha_theory:.3f}"))
fig.update_layout(
title="Category-size distribution from Simon's model vs Yule–Simon",
xaxis_title="k (category size)",
yaxis_title="fraction of categories of size k",
)
fig.update_xaxes(type="log")
fig.update_yaxes(type="log")
fig.show()
print("T:", T)
print("number of categories:", counts.size)
print("largest category size:", int(counts.max()))
print("alpha_theory (from p_new):", alpha_theory)
T: 50000
number of categories: 9973
largest category size: 8061
alpha_theory (from p_new): 1.25
11) Pitfalls#
Invalid parameters: (\alpha) must be positive. In SciPy, passing non-positive values will produce errors or
nan.Infinite moments: remember the thresholds:
(\alpha\le 1): mean is infinite
(\alpha\le 2): variance is infinite
(\alpha\le 3): skewness is infinite
(\alpha\le 4): kurtosis is infinite
In finite samples you will still get finite empirical moments, but they can be extremely unstable.
Numerical issues:
For large (k), the PMF is tiny; compute in log space (
logpmf) when doing likelihood work.For large (k), the CDF can be extremely close to 1; using
sf/logsfis usually more stable.SciPy’s entropy and higher moments may warn about slow convergence for heavy tails.
Sampling extremes: because of the heavy tail, maximum values in a sample can be surprisingly large even when the mean is finite.
12) Summary#
yulesimonis a discrete, heavy-tailed distribution on ({1,2,\dots}) with shape parameter (\alpha>0).The PMF is (p(k)=\alpha B(k,\alpha+1)) and the CDF has a simple survival-function form (\mathbb{P}(K>k)=\alpha B(k+1,\alpha)).
The (m)-th moment exists iff (\alpha>m); in particular, mean requires (\alpha>1) and variance requires (\alpha>2).
A clean sampler comes from the Beta–geometric mixture: (P\sim\mathrm{Beta}(\alpha,1)), then (K\mid P\sim\mathrm{Geometric}(P)).
SciPy’s
stats.yulesimonprovides PMF/CDF/SF/RVS, andstats.fitcan estimate parameters by MLE.